SparkGC: Spark based genome compression for large collections of genomes

Since the completion of the Human Genome Project at the turn of the century, there has been an unprecedented proliferation of sequencing data. One of the consequences is that it becomes extremely difficult to store, backup, and migrate enormous amount of genomic datasets, not to mention they continue to expand as the cost of sequencing decreases. Herein, a much more efficient and scalable program to perform genome compression is required urgently. In this manuscript, we propose a new Apache Spark based Genome Compression method called SparkGC that can run efficiently and cost-effectively on a scalable computational cluster to compress large collections of genomes. SparkGC uses Spark’s in-memory computation capabilities to reduce compression time by keeping data active in memory between the first-order and second-order compression. The evaluation shows that the compression ratio of SparkGC is better than the best state-of-the-art methods, at least better by 30%. The compression speed is also at least 3.8 times that of the best state-of-the-art methods on only one worker node and scales quite well with the number of nodes. SparkGC is of significant benefit to genomic data storage and transmission. The source code of SparkGC is publicly available at https://github.com/haichangyao/SparkGC. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-022-04825-5.

thereby their compression ratio is limited. No matter what general-purpose compression method is used, the compression ratio can only reach 7:1 at most, which cannot completely resolve the challenge [4]. In recent years, researchers have proposed many special-purpose genome compression methods. Compared with the general-purpose compression methods, their compression ratio has been greatly improved [5].
As a data compression method, the compression ratio is the first evaluated factor in most situations, so this is the direction that genome compression researchers have been working on. But in the face of big data, the compression time is gradually emerging to be an urgent problem for researchers to resolve [6]. The compression method is always a trade-off between compression ratio and compression speed. The latest research results show that the compression ratio of pair-wise human genomes compressing has increased to an average of more than 300:1, but the compression time has also increased to more than 10 min per person [7]. The time cost may be tolerable when compressing a small amount of genomic data, such as 100 or 1 K human genomes. However, in the scenario of data archiving or data migration in the genomic data centers, it is very common to compress 10K, 100K, or even 1 million human genomes. At this time, the compression time required by current genome compression methods becomes intolerable.
Another problem is that with the continuous increase of genomic data, cloud storage, the storage mode specially designed for big data, has gradually entered the bioinformatics community [8]. Cloud storage enables users to use storage facilities on demand without the huge cost of building and maintaining expensive infrastructure. With recent services by Amazon and alike, it is possible to rent almost arbitrary configurations, which look logically as a single machine, but is in fact distributed. In this setup, when storing the genomic big data, the user does not need to care about using the cloud, since the infrastructure is hidden from the user. Although most genome compression methods can still be used in the cloud, they are almost single machine algorithms, which cannot fully utilize the computing power of distributed nodes. Developing a genome compression method that can be executed directly in distributed parallel systems has become a better solution [9].
Another advantage of studying parallel and scalable genome compression algorithms is that they can implement more complex compression schemes with high space-time requirements. The key factors of referential genome compression algorithms have been developed from maximum exact match (MEM) search to the prediction or calculation of precise differences between sequences. Solving the differences between two sequences is a global optimization problem, and solving the differences among large collections of sequences is an approximate NP-hard problem. They both have high computational complexity [6]. Distributed parallel algorithms can speed up these solvings that make the research of genome compression have more space.
In recent years, several distributed parallel frameworks have emerged to efficiently manage and process large datasets. The most popular of which are Hadoop [10] and Spark [11]. Both of them are open source big data frameworks of Apache. The core designs of the Hadoop framework are Hadoop Distributed File System (HDFS) and MapReduce. HDFS provides storage for massive data, while MapReduce provides the calculation for massive data. The MapReduce framework has a limitation on programmability though, as it requires the programmer to write code where the Map phase is always followed by the Reduce phase. Moreover, it saves intermediate data to the disk between Map phase and Reduce phase, which increases disk access overhead. Spark is a general parallel framework like MapReduce. It is open source by UC Berkeley AMP lab. Spark has the advantages of MapReduce and can also use HDFS as the distributed file system. But different from MapReduce, Spark allows programmers to perform many other transformations besides just Map and Reduce, while keeping data in memory between these transformations. These distributed parallel frameworks are different from the multi-core parallel schemes, which can be realized by modifying the code slightly. Only if the architectural details and the specific aspects of the considered framework are carefully taken into account for the algorithm design and implementation, genome compression can be developed well on these frameworks.
In this manuscript, we propose and implement a Spark based genome compression (SparkGC) method that allows running efficiently and cost-effectively on a scalable computational cluster to compress large collections of genomes. SparkGC utilizes Spark's inmemory computation capabilities to improve the performance of genome compression. The contributions of this manuscript are summarized as follows: • We proposed Spark based genome compression scheme for the first time. Although there are some genome compression methods based on distributed parallel computing, such as FastDRC [12], they are all based on MapReduce framework. To our best knowledge, there is no large collections of genomes compression method based on Spark. Our method proved the feasibility of compressing large collections of genomes via Spark. Furthermore, our research results indicated that Spark is more suitable for the iterative compression of large collections of genomes. • We designed and implemented SparkGC: a production-quality and highly scalable large collections of genomes compression method. SparkGC meticulously designs the RDD (Resilient Distributed Datasets) transformations to keep data active in memory among the whole compression process. SparkGC improves the compression speed for about 4 times on the cluster with only one worker node and scales excellently by increasing the number of worker nodes. • We optimized the framework by using Kyro serialization and broadcast variables compression that enable SparkGC to compress 1100 human genomes on a common computer with just 24 GB of RAM. We optimized the encoding scheme for the mapping results that makes SparkGC achieve the best compression ratio among all the state-of-the-art compression methods.
The remainder of this manuscript is organized as follows. In "Related works" section, we discuss the related works on genome compression and its parallelization. "Methodology" section presents the methodology of SparkGC. This is followed by "Results" section, which evaluates the performance of the proposed algorithms, including compression ratio, compression speed, scalability, robustness, and trade-off. We finally conclude the manuscript with "Conclusions" section.

Related works
DNAZip [13] proposed in 2009 compressed James Watson's 3 GB genome to 4 MB, so small, that it even can be sent by email attachment. The high compression ratio of the referential genome compression immediately aroused researchers' interest, made more and more researchers focus on referential genome compression research and obtain many achievements. Table 1 summarises the related works of this paper. More works about genome compression can be referenced in review articles [14][15][16].
In recent decade, the performance of referential genome compression method continues improving, including compression ratio, robustness, scalability and applicability. The object of compression is also extended from single sequence to large collections of sequences. Researchers improve the compression method from every stage of compression process, such as sequence pre-processing, reference selection, index building, matching scheme, parallel scheme, etc. Specifically, in terms of sequence pre-processing, earlier proposed methods, such as FRESCO [17] and COGI [18], converted all characters to lowercases or uppercases and treated all nonbase characters as 'n' . The pre-processing scheme reduces the matching complexity, but losses some information. Recent proposed methods basically preserved all information of target sequences, that is, achieved lossless compression. In terms of reference selection, COGI uses the technology based on co-occurrence and multiscale entropy. FRESCO uses the technology based on RsbitX. RCC [19] and ECC [6] cluster the target genomes, and choose the centroid of each cluster as the reference sequence. In terms of index building, researchers employed different technologies (e.g. suffix tree, suffix array, hash array, compressed suffix tree, etc.) to adapt different scenarios. In terms of matching scheme, researchers proposed greedy matching, segmentation matching and approximate matching schemes. Due to the improving technologies and schemes, the compression ratio is getting better and better, from dozens [20] to more than 400:1 [6]. When the second-order compression scheme is employed, the compression ratio achieves even more than 2000:1 [21]. However, the cost is the increasing time complexity. With the exponential increase of genomic data, intolerable compression time emerges to be a problem that compression researchers have to work hard to resolve. Therefore, in order to reduce compression time, some researchers started to employ parallel technology. But the most used parallel technology is the straightforward multithreaded parallel technology. With the development of the big data processing technology, some Hadoop based genome compression method are proposed [22].But generally speaking, the research based on big data processing technologies still has a lot of work to be done. So far, to our best knowledge, there is no published research on Spark based genome compression, but only some Spark based genome analysis achievements [23].

Methodology
This section will firstly introduce the architecture of SparkGC, and then describe separately how does SparkGC parallelize the compression tasks, and finally introduce the decompression.

Architecture
The architecture for large collections of genomes compression based on Spark is shown in Fig. 1. Spark architecture divides the computing cluster to master nodes and worker nodes. The compression algorithm is deployed on the master node, but the scheduling mechanism of Spark is migrating the computing tasks to nodes closest to the data, so the compression tasks will be scheduled to worker nodes. The Driver component of Spark executes the main function of genome compression in SparkGC. TaskScheduler Hadoop component partitions the compression tasks and schedules them to each executor. Executor is a process running on worker node to execute compression tasks and cache intermediate results of RDD transformation. The master node reads the reference sequence from HDFS or local file system, and builds the index of the reference sequence. The driver broadcasts the reference sequence and its index as broadcast variables. The executor stores the broadcast variables to the BlockManager component. Broadcast variable is a shared variable mechanism of Spark. It enables the programs to send large size readonly data to all worker nodes. The worker nodes read the to-be-compressed sequences from HDFS before the compression and write the compressed results to HDFS after compression.

Pre-processing
The data flow of SparkGC is shown in Fig. 2. After each sequence file is read into memory, pre-processing of the sequence file follows closely. The sequence file is divided into two parts, base data, and auxiliary data. Base data refers to the base sequence composed of uppercase ACGT, and auxiliary data is the identifier, line break characters, special characters, and other information contained in the sequence. Because SparkGC is a lossless genome compression method, the auxiliary data of the to-be-compressed sequence cannot be lost. They are compressed independently with specific coding schemes at the pre-processing stage. The base data of the to-be-compressed sequence file saved as RDD to the memory of worker nodes for compression tasks. It is the first RDD of SparkGC It is used as a whole, but actually the data of RDD are distributed in the memory of different worker nodes. The base data of one to-be-compressed sequence is one partition of RDD1. Each partition maps to a processing thread, which ensures that the process of these partitions is independent and concurrent. The hash index built for the reference sequence data is on the master node. However, all worker nodes require reference sequence data and its hash index, so broadcast variables need to be created for them. In order to reduce the size of broadcast variables, Kryo [32] is used for serialization and compression of broadcast variables. The broadcast variables are set to read-only, which will not cause thread-safety problems. Then the broadcast variables are sent and cached in the memory of each worker node to prepare for the compression stage.

Compression
The compression stage of SparkGC contains two steps which are referred to as the firstorder compression and the second-order compression. The main task of the first-order compression is mapping the to-be-compressed sequences to the reference sequence  based on the hash index, that generates the MEMs. The MEMs obtained at this stage are encoded as the tuple < position, length >. The mismatched sequence data is stored as the original characters. Therefore, after the first-level mapping, the original sequence data is converted to a new sequence composed of triple < position, length, mismatched string>. All the mapped results are not saved to the file system, but saved in the memory for the second-order compression. They are the second RDD of the data flow of SparkGC, i.e., RDD2. The mapped results of one to-be-compressed sequence are saved as one partition of RDD2. The partition is represented as < key, value >, where the key is the partition number and the value is the mapped results. The partition number is used to identify the sequence ID, so as to ensure the sequence order in subsequent processing. If the data amount of the value exceeds the memory limitation of the worker node, Kryo serialization will be used again to compress the data to prevent compression failure because of insufficient memory, that makes the compression of large collections of genomes successful on an ordinary computer.
After the first-order compression, a part of the compressed sequences will be regarded as the references of the second-order compression, hereinafter they are referred to as the second-order references. The second-order references are filtered out according to the sequence ID. Then, the hash index for the second-order references is built. The secondorder references and their hash indexes are cached in memory as RDD3. The partitions of RDD3 are distributed on different worker nodes. Because like the first-order compression, all worker nodes need to use the second-order references and their hash indexes at the second-order compression stage. Therefore, all partitions in RDD3 are merged into one partition, i.e., RDD4. The master node then creates the broadcast variable based on RDD4 and sends it to all worker nodes. The merging will generate shuffle, that results in network transmission and disk access. However, after the first-order compression, the size of the compressed sequence has been reduced by more than 100 times compared to the original size, so the amount of shuffle data is not large. The first-order compressed results RDD2 is also kept in memory for the second-level matching.
The essence of the second-level matching is mapping the first-order compressed sequences to the second-order references by order using the hash indexes. The firstorder compressed sequences are read from RDD2. The second-order references and their hash indexes are read from broadcast variables. After the second-level matching, the original first-level mapped results are converted to the second-level mapped results composed of triples < sequence ID, position, length> and triples < position, length, mismatched string>. Lastly, the second-order references are also compressed. The i-th second-order reference is mapped with the first to the (i-1)-th second-order references. In this way the second-order references can be losslessly reconstructed.
After the second-level mapping, all the mapped results are written to HDFS, one file for one sequence. These files will be downloaded to the local file system, compressed by BSC compression algorithm (http:// libbsc. com/). So far, the whole compression is completed. SparkGC compression algorithm is summarized in Algorithm 1. Yao et al. BMC Bioinformatics (2022) 23:297 Decompression Firstly, the compressed file is decompressed by BSC decompression algorithm to get all the second-order compressed data. The reference sequence is read and extracted in the same way as compression. But unlike compression, there is no need to build any hash index in decompression. So decompression and compression is asymmetric. Decompression requires much less memory and time. The target sequences are read by order, their base data and auxiliary data are reconstructed respectively. It is worth to note that SparkGC supports decompressing the sequences interested without decompressing all the sequences every time. The overview of the decompression is shown in Fig. 3. If the target sequence is one of the second-order references, the i-th target sequence only depends on its previous i-1 sequences. The target sequence can be totally decompressed without decompressing the remainder sequences. If the target sequence is not one of the second-order references, its decompression only depends on all the second-order Fig. 3 Overview of the decompression of SparkGC references, other sequences don't need to be decompressed. For large genomic datasets, this will save decompressors a lot of time.
Because more than 95% of decompression time is I/O time, the actual limitation of the decompression is the hard disk write speed. That is different from the compression which is CPU-bound and memory-bound. It has no effect to parallelize the decompression. Therefore, SparkGC does not implement the parallelization of decompression.

Results
We evaluate the performance of SparkGC in this section. SparkGC was run on the cluster with 4 worker nodes and 1 master node. Each node is a common computer configured with 2 × 2.8 GHz Intel Xeon E5-2680 (20 cores) and 32 GB RAM. SparkGC was run over YARN (Yet Another Resource Negotiator) in the platform.
The datasets we selected firstly were 1000 Genome Project [33] which contains 1092 human genomes. In addition, we supplemented another 10 genomes HG13, HG16, HG17, HG18, HG19, HG38, K131 (the abbreviation of KOREF_20090131), K224 (the abbreviation of KOREF_20090224) [34], YH [35], and Huref [36]. These 10 human genomes are derived from different sequencing teams using different methods in different periods. They have different characteristics so that they are widely used in genome compression algorithms evaluation [7,29,30]. Therefore, our datasets totally contain 1102 human genomes and the total file size is about 3.11 TB. All the datasets can be downloaded from open access FTP server. Details of these datasets are provided in the Additional file 1.

Compression ratio
As a compression method, the compression ratio is always the first factor to be evaluated. Firstly we arbitrarily selected HG16 as the reference to compress other 1100 human genomes. In the robustness section, we evaluated the compression performance under different references. In addition, we also tested the compression ratio and compression time of the state-of-the-art genome compression methods in recent 4 years. They compressed the same 1100 human genomes under the same reference run on the same computers. These compression methods are HiRGC [29] proposed in 2017, SCCG [30] proposed in 2018, HRCM [21] proposed in 2019, and memRGC [7] proposed in 2020. Their compression ratios are shown in Fig. 4. Details of the experimental results are provided in Additional file 1: Table S3.
We can see from the Fig. 4, SparkGC achieved the best compression ratio among all the compression methods. The compression ratio of SparkGC is about 2347:1, it compressed the 3.11 TB original data to about 1387 MB. Compared to HiRGC, SCCG, and memRGC, the compression ratio is improved by 673%, 653%, and 582%. The compression ratio of SparkGC is greatly improved mainly because of the second-order compression scheme. After the first-level matching of each to-be-compressed sequence to the reference sequence, the matched results are not written to file, but saved in memory as intermediate data. Part of these intermediate sequences are selected as the second-order references to build hash index, then each first-order compressed sequence is compressed again according to the second-order hash index matrix. This compression scheme fully utilizes the similarity among the to-be-compressed sequences, greatly reduces the size of the compressed file. HRCM is also a second-order compression method, but compared to HRCM, the compression ratio of SparkGC is improved by 31%. The reason is that SparkGC uses BSC compression algorithm to compress the second-order compressed sequences.

Compression speed
The compression speed of compressing 1100 human genomes using HG16 as the reference sequence is shown in Fig. 5. Details of the experimental results are provided in Additional file 1: Table S4. Because SparkGC is a distributed parallel method, it will distribute the compression tasks on multiple nodes and run at the same time. In order to be as fair as possible, in this experiment, the number of worker nodes of SparkGC was set as 1, that is, SparkGC only used one worker node to perform the compression. The compression speed of multiple nodes will be illustrated in the scalability section. All compression time of this paper corresponds to the 'real' or wall-clock elapsed time. Each experiment was executed 3 times and the average time was recorded. As can be seen from Fig. 5, although SparkGC only used one worker node to execute the compression, the compression speed achieved more than 58 MB/s, which is much higher than the best state-of-the-art methods. It only took 15.53 h to complete the compression of 3.11 TB genomic data. The compression speed is 5.63 times of HiRGC, 21.75 times of SCCG, 11.21 times of memRGC, and 3.85 times of HRCM. SCCG takes the most time, more than 14 days. It is hard to tolerate so much time to compress 1100 human genomes. SparkGC reduced the compression time of several days required by other methods to just more than half a day. The reason why SparkGC can achieve such high speed on one node is that the algorithm will make full use of the multi-thread of a single node automatically for compressing.

Scalability
The biggest advantage of SparkGC is not the performance on a single node, but its high scalability, which is the advantage that other methods do not have. We did a series of experiments to evaluate the scalability of SparkGC. Firstly we evaluated the compression speed of all chromosomes on the cluster with an increasing number of worker nodes activated, ranging from 1 to 4. The compression ratio of SparkGC does not correlate with the number of worker nodes, so the increasing number of worker nodes will not change the compression ratio. The total compression time of all chromosomes under different numbers of worker nodes is shown in Fig. 6. Here we observe that, with the increasing number of the worker nodes, the compression time decreases greatly. When the number of worker nodes is 4, SparkGC was able to compress the 3.11 TB genomic data to about 1387 MB in less than 6 h. The compression speed is about 151 MB/s. In terms of runtime and parallelism, the following experiment evaluated the compression process of SparkGC from four stages: In these four stages, the first stage and the last stage cannot be parallelized, they are run only on the master node. Only the second stage and the third stage can be parallelized. Therefore, to evaluate the change of runtime of different stages with the increasing number of worker nodes, we illustrate the compression time of Chromosome 1 (abbreviate as Chr1) and Chromosome 13 (abbreviate as Chr13) in detail, as shown in Table 2.
It can be seen from Table 2 that the pre-processing stage and the post-processing stage did not save time with the increase of worker nodes, on the contrary, their runtime increased with the increase of worker nodes. Because with the increase of worker nodes, the program needs to initialize all worker nodes, and broadcast variables need to be sent to all worker nodes, which increases the runtime. Similarly, at the postprocessing stage, the increase of worker nodes will lead to a longer clean-up time.
Observing the first-order compression stage and the second-order compression stage will find that these two stages scaled quite well. For example, to Chr1, when the number of worker nodes was 1, 2, and 4, the average first-order compression time of each chromosome was 5, 2.4, and 1.5 s respectively, and the average second-order compression time of each chromosome was 0.48, 0.23, and 0.12 s respectively. To Chr13, it was 1.38, 0.83, and 0.46 s respectively at the first-order compression stage and 0.11, 0.06, and 0.03 s respectively at the second-order compression stage. The runtime was almost decreasing at the linear speed. From the percentage of runtime at each stage, the percentage of serial computing was gradually increasing, while the percentage of parallel computing was gradually decreasing, so the overall runtime showed a sublinear downward trend.
The above experiments are all compressing 1100 human genomes. We are very interested in how the compression ratio and compression speed of SparkGC scale with the number of the to-be-compressed sequences. Because in the actual compression scenario, SparkGC will compress any size of genomic data sets. We evaluated the compression ratio and compression speed of Chr1 and Chr13 when the sequence number was 200, 400, 600, 800, and 1000 respectively, as shown in Fig. 7. In this experiment, the number of worker nodes was 3.
We can see from Fig. 7 that SparkGC also scaled quite well to the number of the tobe-compressed sequences. The compression ratio and compression speed of Chr1 and Chr13 both increased with the increase of the to-be-compressed sequences. The compression ratio of Chr1 gradually increased from 1756:1 to 2390:1, and that of Chr13 increased from 2085:1 to 2871:1. The compression speed of Chr1 increased gradually from 65 MB/s to 106 MB/s, and that of Chr13 increased from 100 MB/s to 139 MB/s. The compression ratio of SparkGC increases with the increase of the to-be-compressed sequences is because in that case, the percentage of the second-order references decreases. The default number of the second-order references of SparkGC is 40. The compression ratio of the second-order references is low, because the i-th second-order reference is compressed only using the first to the (i-1)-th sequences as references. So the smaller i is, the smaller the compression ratio is. The compression speed of SparkGC also increases with the increase of the to-be-compressed sequences, because the percentage of the first-order compression time and the second-order compression time increases. The compression speed and compression ratio of Chr13 is higher than Chr1 is because of the different similarity of different chromosomes. Generally, the greater the similarity of chromosomes, the higher the compression ratio and the faster the compression speed [37].

Robustness
The compression ratio of the referential compression method is easily affected by the reference sequence. Therefore, some researchers studied the reference selection [6] [17] [19]. However, with the increase of the to-be-compressed sequences, the selection of reference  Tables 3 and 4 respectively. In the two tables, AVG represents the average value under different references, it is computed by (1).
where s i represents the compressed file size, and the n value is the number of references. SD is the standard deviation of all values, represents the degree of dispersion. It is computed by (2).  times that of the minimum compressed size, respectively. When HG13 was the reference sequence, SCCG even failed to compress all sequences. The reason why SparkGC is less affected by the reference sequence is that if the similarity between the reference sequence and the to-be-compressed sequence is low, many identical mismatched fragments will be generated after the first-level matching, and these mismatched fragments will be matched and compressed in the second-level matching, so the compression result has little relationship with the similarity between the reference sequence and the to-becompressed sequence. HRCM is also one of the second-order compression methods, so HRCM also performs well in robustness.
As can be seen from Table 4, SparkGC performed better in the robustness of compression time. The SD values of Chr1 and Chr13 are only 0.1 and 0.01 respectively, which are far lower than other compression methods. In terms of compression time, the maximum and smallest minimum compression time of Chr1 are 42 min and 24 min respectively; the maximum and minimum compression time of Chr13 are 20 min and 9 min respectively, and the difference is very small.

Discussion
Data compression is always a trade-off between compression ratio and compression speed, so is SparkGC. When the reference sequence is determined, the most important factor affecting compression ratio and compression speed is the number of the secondorder references. In the second-level matching, the more the second-order references are, the greater the probability of matching successfully, so the compression ratio is higher. However, the cost is that the shuffle time and the hash index building time of the second-order references, the transferring time of broadcast variables, and the search time in hash index will be longer. We evaluated the trade-off between compression ratio and compression speed of SparkGC under 7 different numbers of the second-order references, as shown in Fig. 8. In this experiment, we evenly selected 8 chromosomes of our datasets for compressing. They are Chr1, Chr4, Chr7, Chr10, Chr13, Chr16, Chr19, and Chr22. The total file disk size of these chromosomes is about 1 TB.
We can see from the figure, with the increase of the number of second-order references, the compressed size decreases, and the compression time increases. The compressed size decreases from 611 MB when the number of second-order references is 10 to 460 MB when the number of second-order references is 70, the reduction rate is 24.7%. However, the compression time increases from 116 to 133 min, with an increase of 14.7%. Therefore, the compressors can choose the appropriate number of the secondorder references according to their own needs.
In order to expand the applicability of the method, we developed sub-modules to compress FASTQ sequence based on the proposed methodology. Furthermore, we used data sets generated by different sequencing technologies including new and traditional ones to evaluate the FASTQ compression modules. The sequencing technologies we selected are Illumina, PacBio, and Oxford Nanopore. Details of the data sets are provided in Additional file 1: Table S2. The compression ratio and speed are shown in Fig. 9.
From Fig. 9, it is shown that SparkGC successfully work on all test data sets. However, the performance changes with different sequencing technologies. It obtains superior performance on the second generation sequencing platform Illumina to the third generation sequencing platforms PacBio and Oxford Nanopore. The reason is that the third generation sequencing technologies obtain longer read length, but which is accompanied by a relatively higher error rate. The error bases result in more fragments when matching the reference sequence, which affects the compression performance. In addition, the unfixed read length value also consumes a certain amount of storage space.
The evaluation of a genome compression method must take into account the main memory usage. Compared with the stand-alone programs, it is more complex to discuss the memory usage of SparkGC. Because the tasks of the master node and each worker node are different, the memory usage is different. The master node is  Compression speed (MB/s) 15.71 Fig. 9 Compression ratio and speed on FASTQ data sets responsible for reading reference sequences, the aggregation of the first-order compression results, and the hash index building. The memory footprint of the master node is affected by the size of the reference sequence and the number of secondorder references. The worker node is responsible for reading the to-be-compressed sequences, the first-order compression, and the second-order compression. The memory usage is related to the number of to-be-compressed sequences. From our experimental observation, compressing 1100 human genomes consumes the most memory. However, whether on the master node or worker node, the memory footprint of SparkGC is less than 20 GB.

Conclusions
This research proposes and implements a genome compression method based on Apache Spark. It can run efficiently on a multi-node cluster to compress large collections of genomes. Compared to the state-of-the-art genome compression methods, the compression ratio and speed are both recognizably improved. Besides, the method has good scalability and robustness. It will greatly benefit the storage of large genomic datasets. However, it should be noted that developing Spark based programs is not a trivial task. As such, they have largely only been embraced in the technology sector. Making Spark based genome compression method easy to use and extend for more non-computer science professionals is our goal at the next stage.
Additional file 1. Details of the data sets and experimental results.